Distances between sequences based on their $k$-mer frequency counts can beused to reconstruct phylogenies without first computing a sequence alignment.Past work has shown that effective use of k-mer methods depends on 1)model-based corrections to distances based on $k$-mers and 2) breaking longsequences into blocks to obtain repeated trials from the sequence-generatingprocess. Good performance of such methods is based on having many high-qualityblocks with many homologous sites, which can be problematic to guarantee apriori. Nature provides natural blocks of sequences into homologous regions---namely,the genes. However, directly using past work in this setting is problematicbecause of possible discordance between different gene trees and the underlyingspecies tree. Using the multispecies coalescent model as a basis, we derivemodel-based moment formulas that involve the divergence times and thecoalescent parameters. From this setting, we prove identifiability results forthe tree and branch length parameters under the Jukes-Cantor model of sequencemutations.
展开▼
机译:基于序列的$ k $ -mer频率计数之间的距离可以用于重建系统发育,而无需先计算序列比对。过去的工作表明,有效利用k-mer方法取决于1)基于模型的距离校正k $ -mers和2)将长序列分解为多个块,以便从序列生成过程中进行重复试验。这种方法的良好性能是基于具有许多具有许多同源位点的高质量嵌段,这对于保证先验性可能是有问题的。大自然将天然的序列块提供到同源区域,即基因。但是,由于不同的基因树和基础物种树之间可能存在不一致,因此在这种情况下直接使用以前的工作是有问题的。以多物种聚结模型为基础,我们推导了基于模型的矩公式,其中包含了发散时间和聚结参数。通过此设置,我们证明了在Jukes-Cantor序列突变模型下树和分支长度参数的可识别性结果。
展开▼